April 4, 2019

Last class

  • Intro to functional genomics assays to study transcriptional regulation and epigenetics
    • ChIP-seq and beyond
  • R Bioconductor infrastructure for analysis of genomic data

Today's topics

  • Practical steps for analysis of ChIP-seq
    • Obtain the data
    • Align reads to the genome
    • Call peaks
    • Identify reproducible peaks between replicates
    • Find motifs

Obtain the data

Sources of genomic data

  • Public repositories
    • Gene Expression Omnibus (GEO)
    • Sequence Read Archive (SRA)
  • Consortia
    • ENCODE
    • Roadmap Epigenomics
    • etc.
  • Your own data
    • Integrated Genomics Operation (IGO)

GEO

SRA

ENCODE

Examples for today

Obtain data from GEO and SRA

Obtain data from GEO and SRA

  • SRA toolkit

Obtain data from GEO and SRA

Read alignment

Sequencing-based assays

Read alignment

  • What is sequence alignment
    • Classic problem in bioinformatics
    • Align (DNA, protein) sequences for evolutionary analysis
    • BLAST
  • Tools to align millions of short reads to large genomes
    • Bowtie2, BWA, STAR

Bowtie2

Align reads

  • Align reads from one sample:
bowtie2 -p 10 -t --no-unal -X 500 -x mm10-bowtie2index/mm10 -U SRR1186971.fastq.gz >SRR1186971.sam 2>.bowtie2.SRR1186971
  • Bowtie2 output:
Time loading reference: 00:00:00
Time loading forward index: 00:00:01
Time loading mirror index: 00:00:00
Multiseed full-index search: 00:04:06
24505004 reads; of these:
  24505004 (100.00%) were unpaired; of these:
    2832435 (11.56%) aligned 0 times
    13902816 (56.73%) aligned exactly 1 time
    7769753 (31.71%) aligned >1 times
88.44% overall alignment rate
Time searching: 00:04:08
Overall time: 00:04:08

Alignment files

  • Often in bioinformatics: data stored in tab-delimited text files, command line tools to manipulate them
    • samtools, bedtools, deepTools, etc.
  • Sequence Alignment/Map (SAM)
    • SAMtools to manipulate SAM files

FastQC

  • Tools for summary statistics and quality metrics
    • FastQC
    • samtools stats
    • MultiQC
  • FastQC

Align reads from all samples

Peak calling

Peaks of read coverage for ChIP-seq, etc.

Peak calling

  • Considerations
    • control: input DNA, non-specific antibody, genetic controls
    • paired-end or single-end sequencing
  • Tools
    • MACS2
    • SPP
  • Reproducible peaks (ENCODE standard)
    • Irreproducible Discovery Rate (IDR)

MACS2 for peak calling

MACS2 for peak calling

macs2 callpeak -t foxp3_rep1.bam -c foxp3_input.bam -f BAM -n foxp3_rep1 -B --SPMR --outdir peaks-macs2/foxp3_rep1/ -g mm -p 0.1 --keep-dup 'auto' --call-summits 2>.macs
  • Peaks output in narrowPeak format (extended BED)

IDR tool

Motif analysis

Binding specificity

  • (Some) Proteins bind to DNA in a sequence-specific manner

  • Motif to represent and model sequence specificity
    • Positional Weight Matrix (PWM)
    • Motif logo

Visualize ChIP-seq

Genome browsers

  • Integrative Genomics Viewer (IGV)

  • UCSC Genome Browser

IGV

Read coverage tracks

  • File formats
    • bedGraph
    • wig
    • bigWig
  • Tools to generate bigWig
    • bedtools genomecov, followed by UCSC tools to convert bedgraph to bigWig
    • deepTools